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A  COMPARISON  OF  THE  FIT  OF  EMPIRICAL  OATA  TO  TWO  LATENT  TRAIT 

HOOELS 


LEAH  R.  HOTTEN 

UNIVERSITY  Oc  MASSACHUSETTS ,  AMHERST 


LATENT  TRAIT  THEORY  HAS  SHOWN  GREAT  PROMISE  FOR  SOLVING 
A  MULTITUDE  OF  MEASUREMENT  PROBLEMS  THAT  HAVE  NOT  BEEN  HANDLED 
ADEQUATELY  BY  CLASSICAL  TEST  THEORY.  ONE  OF  THE  MOST 
IMPORTANT  GAINS  TO  BE  MADE  USING  LATENT  TRAIT  THEORY  IS  IN  THE 
FIELD  OF  TEST  EQUATING.  WITH  LATENT  TRAIT  ABILITY  ESTIMATES, 
IT  IS  POSSIBLE  TO  EQUATE  TESTS  WHICH  ARE  NOT  PARALLEL,  ANO 
WHICH  DO  NOT  EVEN  CONTAIN  THE  SAME  NUMBER  OF  ITEMS.  THE 
NATIONAL  READING  TEST  EQUATING  STUDY  (RENTZ  AND  BASHAW,  1975) 
HElPEO  S°UR  INTEREST  BY  PRA CT I TI ON E RS  IN  LATENT  TRAIT  ABILITY 
ESTIMATION.  THEORETICALLY  IT  IS  NOW  POSSIBLE  TO  CONDUCT 
EVALUATIVE  STUDIES  ON  SCHOOL  CHILDREN  WHO  HAVE  TAKEN  DIFFERENT 
ACHIEVEMENT  TESTS.  A  SECOND  IMPROVEMENT  BROUGHT  ABOUT  THROUGH 
THE  USE  OF  LATENT  TRAIT  HOOELS  OCCURS  IN  THE  FIELD  OF  TEST 
DEVELOPMENT.  HERE,  IT  IS  POSSIBLE  TO  DESIGN  TESTS  AT 
SPECIFIC  DIFFICULTY  LEVELS,  WHICH  CAN  BE  HIGHLY  DISCRIMINATING 
WITHIN  GIVEN  ABILITY  RANGES.  TESTS  CAN  BE  "TAILORED"  TO 
STUDENTS*  INDIVIDUAL  NEEDS. 

BECAUSE  MAUOR  IMPROVEMENTS  IN  MEASUREMENT  ARE  EXPECTED 
USING  LATENT  TRAIT  THEORY,  SCHOOL  SYSTEMS  AND  GOVERNMENT 
EDUCATIONAL  ORGANIZATIONS  AROUND  THE  COUNTRY  HAVE  SHOWN 
INCREASED  INTEREST  IN  USING  LATENT  TRAIT  MODELS.  THIS 
INCREASE  IN  INTEREST  IS  ALSO  ATTRIBUTED  TO  THE  THEORY'S 
INCREASING  ACCEPTANCE  BY  THE  MEASUREMENT  COMMUNITY  ITSELF,  AND 
FINALLY,  TO  TECHNOLOGICAL  A0V4NCES  IN  80TH  LATENT  TRAIT 
PARAMETER  ESTIMATION  ANC  COMPUTER  METHOOS.  ALTHOUGH  WE  ARE 
CURRENTLY  'jETnES3InG  Th£  uSE  OF  LATENT  TRAIT  m o D F l  S  IN  A 


VARIETY  OF  APPLIED  SETTINGS  (SEE,  FOR  EXAMPLE,  HAM3LET0N 
ET.Al.,  1979*.  RENTZ  ANO  RENTZ,  1978  ).  MANY  3ASIC  RESEARCH 
QUESTIONS  CONCERNING  LATENT  TRAIT  THEORY  HAVE  NOT  YET  BEEN 
SATISFACTORILY  ANSWERED.  THE  RESEARCH  REPORTED  IN  THIS  STUDY 
WAS  DESIGNED  TO  PROVIOE  NEEDED  INFORMATION  FOR  EFFECTIVE 
APPLICATION  OF  LATENT  TRAIT  MODELS  8Y  PRACTITIONERS . 


PURPOSE 


THE  PRIMARY  QUESTION  ADDRESSED  IN  THIS  STUOY  WAS  HOW  WELL 
DO  EMPIRICAL  .  DATA  FIT  THE  ONE  AND  T HREE -P AR AME TE R  LCGISTIC 
LATENT  TRAIT  MODELS, THE  MODELS  OF  MOST  CJRRENT  INTEREST  IN  THE 
MEASUREMENT  COMMUNITY.  ALTHOUGH  THERE  ARE  MANY  CLAIMS  THAT 
BOTH  ACHIEVEMENT  AND  APTITUDE  DATA  FIT  RASCH  < ONE -P AR AME T ER > 
MODELS,  ANO  EQUALLY  STRONG  CLAIMS  CONCERNING  FIT  OF  DATA  TO 
THE  THREE-PARAMETER  LOGISTIC  MODEL,  LITTLE  RESEARCH  HAS 
ADDRESSED  THE  QUESTION  OF  COMPARA3LE  MODEL  FIT.  THREE 
QUESTIONS  SEEM  ESPECIALLY  IMPORTANT: 

1. SHOULD  THE  PRACTITIONER  SELECT  THE  RASCH  MOOEL  WITH  ONE 
TYPE  OF  DATA,  ANO  THE  BIRNBAUM  ( THR EE -PAR A  MET ER )  MODEL  FOR 
OTHER  KINDS  OF  DATA? 

2. IS  THERE  IMPROVEMENT  IN  MODEL-DATA  FIT  FOUND  BY  USING 
THE  THREE-PARAMETER  MODEL,  RATHER  THAN  THE  RASCH  MODEL? 

3. HOW  CAN  PRACTITIONERS  DETERMINE  WHICH  TEST  MODEL  (THE 
ONE  OR  THREE-PARAMETER  MODEL)  BEST  SUIT  THEIR  DATA? 

ANSWERS  TO  THE  ABOVE  QUESTIONS  HAVE  BEEN  SOUGHT 
PRIMARILY  THROUGH  SIMULATION  STUDIES.  THERE  IS  INSUFFICIENT 
EVIOENCE  FAVORING  ONE  OR  THE  OTHER  LATENT  TRAIT  MOOELS  FROM 
RESEARCH  USING  EMPIRICAL  OATA.  WHAT  FOLLOWS  ARE  SOME  RESULTS 


THAT  HAVE  BEEN  ACCUMULATED  CONCERNING  MODEL  FIT.  HAMOtETON 
AND  TRAUB  (1973)  COMPARED  THE  ONE  ANO  T WO -P AR AME T E ft  LOGISTIC 
MODELS  WITH  VERBAL  ANO  MATH  APTITUDE  OATA  USING  HEURISTIC 
ESTIMATES  OF  LATENT  TRAIT  ITEM  PARAMETERS.  IMPROVEMENT  IN 
FIT,  OEFINEQ  B V  A  CHI  SQUARE  TEST  BASED  ON  OBSERVED  AND 
EXPECTED  RAH  SCORE  FREQUENCIES,  HAS  FOUND  FOR  THE 
TWO-PARAMETER  MOOEL.  A  RECENT  STUDY  BY  <DCH  AND  RECKASE 
(1973)  EXPLORED  THE  FIT  OF  THE  ONE  AND  THREE-PARAMETER 
«  9GISTIC  MODELS  FOR  APTITUOE  AND  ACHIEVEMENT  TEST  DATA  USING  A 
MEAN  SQUARE  DEVIATION  STATISTIC.  IN  THIS  STUDY,  THE 
THREE-PARAMETER  HOQEL  CONSISTENTLY  FIT  DATA.  3ETTER  THAN  THE 
ONE-PARAMETER  MOOEL.  UNFORTUNATELY ,  THE  SAMPLING  DISTRIBUTION 
FOR  THE  MEAN  SQUARE  DEVIATION  STATISTIC  IS  UNKNOWN,  AND  THUS 
THE  RESULTS  OF  THIS  STUDY  HAVE  QUESTIONABLE  VALIDITY.  RENTZ 
AND  RENTZ  (1978)  COMPARED  THE  FIT  OF  APTITUDE,  ACHIEVEMENT, 

AND  CRITERION  REFERENCED  TEST  OATA  TO  THE  RA3CH  MODEL,  USING 
THE  HRIGHT  AND  P ANC HAPA KE S A N  (1969)  FIT  STATISTIC.  IT  WAS 
REPORTED  THAT  THE  RASCH  MOCEL  ADEQUATELY  REPRESENTED  THESE 
THREE  DIVERSE  KINDS  OF  DATA.  A  COMPARISON  OF  THE  ONE,  TWO, 

AND  THREE-PARAMETER  MODELS  WAS  CONDUCTED  BY  HAMBLETON  AND  COOK 
(1973)  UTILIZING  SIMULATED  DATA.  THIS  TECHNIQUE  ALLOWED  THE 
RESEARCHERS  TO  COMPARE  ESTIMATED  PARAMETERS  TO  THE  TRUE  VALUES 
FROM  WHICH  THE.  DATA  WERE  GENERATED.  THESE  RESEARCHERS  FOUND  A 
SIGNIFICANT  IMPROVEMENT  BY  EMPLOYING  THE  THREE-PAR AhETER 
LOGISTIC  MOCEL,  ESPECIALLY  WITH  SHORT  TESTS. 

THE  RESULTS  FROM  THIS  STUDY  PROVIDE  AN  INDICATION  Oc  THE 
ADEQUACY  OF  LATENT  TRAIT  THEORY  FOR  EXPLAINING  TEST  BEHAVIOR. 
THE  RESULTS  INCLUDE  EVIDENCE  ON  WHICH  OF  THE  ONE  OR 
THREE-PARAMETER  LOGISTIC  MODELS  BEST  SUIT  VARIOUS  TYPES  OF 
DATA.  HOPEFULLY,  THE  INFORMATION  PROVIDED  HERE  CAN  SERVE  AS  A 
GUI OE  FOR  PRACTITIONERS  IN  SELECTING  LATENT  TRAIT  MOOELS  FOR 
USE  IN  TEST  CONSTRUCTION  ANO  TEST  ANALYSIS. 


RESEARCH  QUESTIONS 


ITEM  DISCRIMINATION  AND  GUESSING 

THE  RASCH  MODEL  IS  BASED  ON  THE  PREMISES  THAT  ITEM 
DISCRIMINATION  IS  ECUAl  -OR  AlL  ITEMS  AND  THAT  GUESSING  OOES 
NOT  OCCUR.  TWO  QUESTIONS  ARISE  IN  THIS  CONNECTION!  l)HOW  CAN 
ONE  DETERMINE  IF  THESE  ASSUMPTIONS  ARE  ^JLlFILLEO  IN  A  DATA 
SET?.  ANC  2 ) 3 A N  OATm  FIT  THE  RASCH  MODEL  EVEN  WHEN  THESE 
ASSUMPTIONS  A  -,E  VIOLATED?  IT  IS  QIFFICJD  TO  ASoUME  THAT 
GUESSING  OOES  NOT  TAKE  F|_ «CE  ON  MJLTIPLE  CHOICE  TESTS.  AND  YET 
THE  RASCh  MODEL  IS  CONSIDERED  ROBJST  WITH  RES3ECT  TC  THIS 
CONDITION  (MEAD.  1976).  A  NUMBER  OF  PRADTICAL  PROCEDURES  H^-VE 
3  -  -  N  iUGGESTru  TO  DETER.  Mi  NE  ThE  EXTENT  OF  GUESSiNG  CN  ITEMS. 
UNFORTUNATELY .  MOST  METHODS  OBSCURE  THE  POSSIBILITY  THAT 
GUESSING  MAY  BE  AS  MUCH  3  ER  SON  OR  ABILITY  RELATED  AS  ITEM 
RELATED  (JEN5EMA,  1974).  IN  THIS  CASE.  NEITHER  THE  RASCH  OR 
THE  THREE-PARAMETER  MOCE-  WOUlD  8E  AN  ADEQUATE  DESCRIPTION  0- 
TEST  BEHAVIOR.  PRACTICAL  METHODS  ARE  UTILIZED  IN  THIS  STUDY 
TO  EXPLORE  THE  EXTENT  OF  GUESSING  IN  A  DATA  SET. 

TWO  STRONG  POSITIONS  ARE  TAKEN  CONCERNING  THE  RASCH 
MODEL  ASSUMPTION  GF  ECUAL  ITEM  DISCRIMINATION.  BIRN3AUM( 
19681,  ROSS  (  1966),  AND  HAM  3LE  T  ON  AND  T  R  A  U  3  (1973)  FOUND 
CONSIDERABLE  VARIATION  in  ITEM  DISCRIMINATION  for  empirical 
OATA.  NEVE: THElESS  ,  IN  oTUDIEi  Ur  THE  %ASCH  MODEL,  RESUlTS 
TYPICALLY  SHOW  THAT  TnE  MODEL  IS  FAIRLY  ROBUST  WITH  RESPECT  TO 
VARYING  ITER;  DISCRIMINATION.  FOR  EXAMPLE,  DlNERO  AND  HAERTEi. 
(1977)  EXPLORED  SIMULATED  DATA  IN  WHICH  CLASSICAL  ITEM 
discrimination  WAS  VARIED  up  TO  .25  variance.  THEY  FOUND  NO 
MAJOR  REDUCTION  IN  FIT  TO  THE  K  A  S  S  H  MOSEl.  ON  THE  OTHER  HAND, 
STUDIES  BY  HAM3LETON  ANC  COOK  (197e)  AND  EY  HAM3LETCN  A.nD 
TRAU3  (1976),  FOUND  THE  OPPOSITE  RESULT,  ESPECIALLY  WHEN  THE 
RANGE  OF  VARIATION  IN  ITEM  PARAMETERS  WAS  LARGE. 


THE  RANGE  OF  ITEM  DISCRIMINAT I  ON  CAN  BE  JE  TERMINED .  TO 
AN  EXTENT,  B X  EXAMINING  CLASSICAL  ITEM  CISCRIMI NAT  ION 
PARAMETERS.  THERE  ARE  N3  REAL  GUIDELINES  AVAILA3lF  FOR 
DETERMINING  AT  WHAT  POINT  THE  RANGE  OF  ITEM  DISCRIMINATION 
PARAMETERS  IS  TOO  GFE  AT  TO  FIT  ASSUMPTIONS  OF  THE  RASCH  MODE.. 
THIS  POINT  IS  AOORESSEO  IN  THE  RESULTS  AND  CONCLUSIONS  OF  THIS 
STUDY. 

UNIDIMENSIONALITY 

THE  ASSUMPTION  THAT  DATA  ARE  U NI Dl ME  NS I3NAL  IS  AN 
ASSUMPTION  UNDERLYING  NEARLY  ALL  OF  THE  POPULAR  LATENT  TRAIT 
MOOELS.  A  SINGLE  ABILITY,  OR  LATENT  TRAIT,  IS  ASSUMED  TO 
JNOERLY  ITEMS  IN  A  TEST.  IN  PRACTICE,  FEW  TEST'S  ARE  TRULY 
UNIDIMENSIONAL  USING  A  FACTOR  ANALYTIC  METHGD.  IT  IS 
CUSTOMARY  TO  FIND  LESS  THAN  25  V.  OF  A  TESTS  TOTAL  VARIANCE 
ACCOJNTED  FOR  BY  A  FIRST,  OR  GENERAL,  r  AC  T  OR  .  HAMBLETON  AND 
TRAU3  (1976)  FOUND,  WITH  ARTIFICIAL  DATA,  THAT  VIOLATION  OF 
THE  ASSUMPTION  CF  U NI Cl  ME  NS IONAL I T Y  LED  TO  POOR  FIT  FOR  DATA 
TO  THE  RASCH  MODEL. 

A  NUMBER  OF  TESTS  FOR  UNI 01  ME NSI ON A L I T Y  HAVE  BEEN 
OF-ERRED  BY  VARIOUS  RESEARCHERS.  LUMSDEN  (1961)  REVIEWED  FIVE 
METHODS  FDR  ASSESSING  UNI DI ME NSI ONA Li T Y  WITHIN  THE  CONTEXT  Cr 
TEST  DEVELOPMENT,  AND  CONCLUDED  THAT  FACTOR  ANALYSIS  IS  THE 
MOST  PROMISING  METHOD.  LATENT  TRAIT  RESEARCHERS  HAVE  USED 
PRINCIPAL  COMPONENT  ANALYSIS,  MAXIMUM  LIKELIHOOD  FACTOR 
ANALYSIS,  AND  PRINCIPAL  AXIS  COMMON  FACTOR  ANALYSIS  TO 
DETERMINE  J NI D I  HENS  10 NAl I T Y  IN  THEIR  OATA.  THERE  EXISTS  SOME 
DISAGREEMENT  IN  THE  LITERATURE  CONCERNING  THE  CORRELATION 
MATRIX  THAT  IS  MOST  APPROPRIATE  FOR  FACTOR  ANALYSIS:  PHI 
COEFFICIENTS  DR  TETRACHGRICS.  THE  LATTER  REPRESENTS  A 
•MEASJRE  OF  RELATIONSHIP  BETWEEN  TWO  ASSUMED  LATENT  VARIABLES 
SCORED  DICHDT OMOUSL Y.  NOT  ONLY  DDES  THIS  ASSUMPTION  AGREE 
WITH  THE  PREMISES  OF  LATENT  TRAIT  THEORY,  BUT  ALSO,  USING 
TETRACHGRIC  CORRELATIONS  IMPROVES  THE  CHANCES  FOR  OBTAINING  A 


FACTOR  ANALYTIC  SOLUTION.  REGAROlESS  gf  the  statistical 
technique  used  to  oetermine  UNICIHENSIONALITY,  one  PERPLEXING 
PRO  3. EH  REMAINS:  DATA  CAN  EE  UNI  DIMENSIONAL  FOR  ONE  SAMPLE  AND 
NOT  FOR  ANOTHER.  CURRENTLY,  NO  STATISTICAL  TECHNIQUE  CAN 
SOLVE  THIS  PROBLEM.  BOTH  THE  RASCH  AN  0  T  HREE-PARAHETEP.  MODELS 
ARE  INVESTIGATED  HERE  WITH  RESPECT  TO  HOW  WELL  THEY  FIT  DATA 
OF  VARYING  DIMENSIONALITY  BASED  ON  A  FACTOR  ANALYTIC 
CRITERION. 


SAMPLE  SIZE  AND  TEST  LENGTH 

ONE  MAJOR  SOURCE  OF  DISAGREEMENT  BETWEEN  LATENT  TRAIT 
THEORISTS  CONCERNS  THE  MINIMUM  PERSON  AND  ITEM  SAMPLE  SIZES 
NEEOEO  TO  OBTAIN  CONSISTENT  LATENT  TRAIT  PARAMETER  ESTIMATES. 
THE  LOGIST  COMPUTER  PROGRAM  MANUAL  (WOOD,  WINGER SKY ,  AND  LORO, 
1976)  SUGGESTS  MINIMUMS  OF  40  ITEMS  AND  L  C  0  0  3ERSQNS.  WRIGHT 
(1977)  CONTENDS  THAT  SMALL  SAMPLES  (10G  ARSONS)  ARE 
SUFFICIENT  FOR  EFFECTIVE  ESTIMATION.  THIS  STUDY  EXPLORES  FIT 
OF  SMALL  SAMPLE  DATA  (20  ITEMS,  250  PERSONS)  TO  THE  RASCH  AND 
THREE-PARAMETER  MODELS.  A  CONSIDERABLY  MORE  EXTENSIVE  STUDY 
OF  THIS  PROBLEM  HAS  BEEN  PREPARED  BY  SW AMINAT  HAN  AND  GIFFORD 
(1979). 


GOODNESS-OF-FIT 

MANY  DEFINITIONS  for  GOODNESS-OF-Fir  APPEAR  IN  THE  LATENT 
TRAIT  LITERATURE  (HAMBLETON,  1979).  NOT  ONLY  DO  DEFINITIONS 
OF  FIT  VARY  FROM  AUTHOR  TG  AUTHOR,  BUT  METHODS  FOR  TESTING  FIT 
OF  MODELS  TO  OATA  VARY  FROM  MODEL  TO  MODEL.  MANY  OF  THE 
STATISTICAL  MEASURES  EmP-OYEO  FOR  TESTING  GOODNESS-OF-FIT  ARE 
CONSIDERED  UNSOUND  (BIRN3AUM,  196S).  THE  CHI  SQUARE  TEST  IS 
OFTEN  UTILIZEO  FOR  GOOONE SS -OF -F I T ,  THOUGH,  GIVEN  A  SUFFICIENT 
SAMPLE  SIZE,  MOST  OATA  WILL  BE  REJECTED  3Y  THIS  MEASURE. 
NEVERTHELESS,  THIS  AUTHOR  HAS  CHOSEN  TO  EMPLOY  CHI  SQUARE  TEST 


STATISTICS  IS  THIS  STUDY.  SINCE  THE  STUDY  IS  COMPARATIVE  In 
NATURE,  ONLY  RELATIVE  FIT  NEED  QE  ASSESSED.  IN  ADDITION,  A 
METHOD  WAS  NEEOEO  THAT  WOULD  3E  APPROPRIATE  TO  BOTH  MODELS 
UNOER  STUOY.  THE  CHI  SQUARE  STATISTIC  MEETS  THESE  CRITERIA. 


METHODOlOSY 


DESCRIPTION  AnO  PROCESSING  OF  TEST  DATA 

FIVE  DATA  SETS  WERE  SELECTED  FOR  THIS  STUDY! 

1.  CALIFORNIA  TEST  O'-  BASIC  SKILLS  -  VOCABULARY  SUBTEST, 
GRADE  10! 

2. CALIFORNIIA  TEST  OF  BASIC  ? (ILLS  -  MATH  COMPREHENSION 
SUBTEST,  GRADE  10! 

3. SCHOLASTIC  APTITUDE  TEST  -  VERBAL,  GRADE  12! 

4. STANFORD  ACHIEVEMENT  TEST  -  VOCABULARY  SUBTEST,  GRADE  5 
5. STANFORD  ACHIEVEMENT  TEST  -  SCIENCE  SUBTEST,  GRADE  5. 

TESTS  WERE  SELECTEO  TO  COVER  A  RANGE  OF  BOTH  CONTENT  ANO 
GRADE  LEVELS.  TWO  LIMITATIONS  WERE  PLACED  ON  DATA  SELECTION. 
FIRST,  A  MINIMUM  SAMPLE  SIZE  OF  100C  WAS  REQUIRED  (AT  A  SINGLE 
GRAOE  LEVEL).  SECONDLY,  THE  MINIMUM  NUMBER  OF  ITEMS  IN  A  TEST 
OR  SUBTEST  WAS  FORTY.  EACH  OF  THE  TESTS  SELECTED  FOR  STUDY 
WAS  FOUNO  TO  BE  RELATIVEwY  UNI  01  ME N SI  ON Au .  IN  PILOT  ANALYSES, 
IT  WAS  FOUNO  THAT  PARAMETER  ESTIMATION  FDR  DATA  WHICH  IS  NOT 
UNIDIMENSIONAL  OFTEN  DOES  NOT  REACH  CONVERGENCE  WITHIN  A 
REASONABLE  TIME  LIMIT  (400  COMPUTER  SECONDS).  ANALYSIS  OF 
DATA  SETS  THAT  DO  NOT  HAVE  A  DOMINANT  SINGLF  FACTOR  IS  PLANNED 
IN  THE  NEAR  FUTURE. 

A  FLOW  CHART  OEPICTING  THE  DESIGN  OF  THIS  STUDY  IS 
PRESENTED  IN  FIGURE  1.  -OR  EACH  DATA  SET,  THE  FOLLOWING  STE^S 


WERE  EXECUTED.  EACH  TEST  OR  SUOTEST  WAS  SCORED  3Y  A  FORTRAN 
PROGRAM.  T  H  £  TETRACHORIC  CORRELATION  MATRIX  WAS  OOTAiNED  AND 
FACTOR  ANALYZED  USING  \  PRINCIPAL  COMPONENTS  SOLUTION. 

RESULTS  OF  THE  FACTOR  ANALYSIS  ARE  USED  TO  CHARACTERIZE  DATA 
IN  TERMS  OF  0 1  ME NS  I  ON AL I T Y .  FOLLOWING  THE  FACTOR  ANAlYSIS,  A 
RANDOM  SAMPLE  OF  1000  CASES  WAS  DRAWN  FROM  THE  TOTAL  SAMPLE. 
THIS  SAMPLE  WAS  RETAINED  FOR  FURTHER  ANALYSIS.  CLASSICAL  ITEM 
ANALYSIS  WAS  PERFORMED  TO  CHARACTERIZE  TESTS  IN  TERMS  OF 
STANDARD  TESTING  METHOOQLOGY  AND  TO  COMPARE  CLASSICAL  WITH 
LATENT  TRAIT  PARAMETER  ESTIMATES.  FOR  EACH  TEST  THE  AVERAGE, 
RANGE,  AND  CDNFIDENCE  3 AN G  FOR  ITEM-TOTAL  CORRELATIONS  WERE 
CALCULATED  TO  EXAMINE  THE  ASSUMPTION  OF  EQUAL  ITEM 
DISCRIMINAT ION.  IN  ATQITION,  CLASSICAL  ITEM  DIFFICULTIES  FOR 
THE  LOWEST  DECILE  OF  EXAMINEES  WERE  COMPUTED  AS  AN  INDICATOR 
OF  GJESSING  ON  DIFFICULT  ITEMS. 

- INSERT  FIGURE  1  AROUND  HERE - 


IN  THE  NEXT  PHASE  OF  THE  STUDY,  ITEM  AND  A  91 L I TY 
PARAMETERS  WERE  ESTIMATED  UNDER  THE  ONE  AND  THREE-PARAMETER 
MODELS  FOR  EIGHT  SAMPLING  CONDITIONS.  TWO  SAMPLE  SIZES,  250 
AND  1  0  0  0  ARSONS,  AND  TWO  TEST  LENGTHS,  20  AND  "TOTAL”  ITEMS, 
WERE  USED.  SAMPLES  OF  ITEMS  WERE  SELECTED  8Y  RANDOM  METHODS. 
RANDOM  SELECTION  OF  PERSONS  UTILIZED  A  SPACED  SAILING 
TECHNIQUE  AFTER  VERIFYING  THAT  THE  ORIGINAL  SAMPLE  OF 
EXAMINEES  WAS  NOT  ORDERED  .  PARAMETER  ESTIMATION  WAS 
ACCOMPLISHED  THROUGH  THE  LOGIST  COMPUTER  PROGRAM  (WOOD, 
WINGERSKY,  AND  LORD,  1975). 

SINCE  THE  INPUT  PARAMETER  SET  FDR  EACH  LOGIST  EXECUTION 
VARIED  GREATLY  (OVER  50  PARAMETERS  CAN  3E  SPECIFIED),  AN 
INTERACTIVE  TIME-SHARING  FORTRAN  PROGRAM,  LOGPREP,  WAS 
DESIGNED  TO  CREATE  INPUT  FILES.  F OR  MOST  THREE-PARAMETER 
MODEL  RUNS  THE  DEFAULT  OPTIONS  OF  LOGIST  WERE  USED.  THE 
ONE-PARAMETER  MODEL  IS  ESTIMATED  3 Y  FIXING  GUESSING  AT  ZERO 
AND  ITEM  DISCRIMINATION  AT  ONE.  OUTPUTS  FROM  LOGIST  ALONG 


WITH  THE  RAW  DATA  WERE  INPUT  INTO  A  FORTRAN  PROGRAM,  THETITM, 
TO  03TAIN  RAW  AND  EXPECTED  RAW  SCORES  UTILIZING  THE 

APPROPRIATE  one  or  three-parameter  item  characteristic 
functions,  the  raw  score  is  defined  ASI 

(2.1) 

WHERE  U  =1  IF  THE  item  IS  ANSWERED  CORRECTLY  AND  U  =0, 
OTHERWISE.  THE  EXPECTED  RAW  SCORE  BASED  ON  LATENT  TRAIT 
THEORY  ISI 


WHERE  P  (  )  IS  THE  PROBABILITY  OF  A  CORRECT  RESPONSE  ON  ITEM  G 
BY  PERSONS  WITH  ABILITY  LEVEL  THETA,  .  TO  COMPARE  OBSERVED 
AND  EXPECTED  RAW  SCORES  (UNDER  EACH  MODE.)  IT  WAS  NECESSARY  TO 
ROUND  EXPECTED  RAW  SCORES  TO  THE  CLOSEST  INTEGER.  FINALLY, 
EXPECTED  AND  3BSERVE0  RAW  SCORES  AND  GROUPED  RAW  SCORE 
FREQUENCIES  WERE  OBTAINED  USING  SPSS.  AN  INTERACTIVE  FORTRAN 
PROGRAM,  CHISQ,  WAS  USEO  TO  PERFORM  CHI  SQUARE  TESTS  FOR  EACH 
H03EL-SAHPLE-TEST  LENGTH  COMBINATI ON. THE  CHI  SQUARE  IS  DEFINED 
ASI 


0  STANDS  FOR  THE  08SERVED  FREQUENCY  AND  E  INDICATES  EXPECTED 
FREQUENCY. 

TO  ASSESS  THE  INFLUENCE  OF  SAMPLE  SIZE  ON  ESTIMATING 
ITEM  PARAMETERS,  AN  ADDITIONAL  LOGIST  RUN  WAS  EXECUTED  UNDER 
THE  ASSUMPTIONS  OF  EACH  MODEL.  THESE  RUNS  USED  ABILITY 
ESTIMATES  (THETA)  FROM  THE  1000-PERSON  SAMPLE  AND  RECOMPUTED 
ITEM  PARAMETERS  ON  A  SMALL  SAMPLE  OF  250  PERSONS.  ANALYSIS  DF 
ITEM  PARAMETERS  WAS  THEN  ACCOMPLISHED  USING  THE  ADAPT 
INTERACTIVE  STATISTICAL  PACKAGE  (A  T I  ME - S HA RIn G ,  A PL- BASED 
STATISTICAL  ANALYSIS  PACKAGE).  ANALYSIS  INCLUOEO  PEARSON  AND 


S®E  ARMAN  CORRELATIONS  between  small  and  large  sample 
PARAMETERS  UNDER  ThE  TWO  MODELS,  AND  IN  ADDITION,  THE  AVERAGE 
ABSOLUTE  DIFFERENCE  BETWEEN  SMALL  AND  LARGE  SAMPLE  PARAMETERS 
UNDER  THE  TWO  MODELS  WmS  OBTAINED.  a  SIMILAR  PROCEDURE  WAS 
UTILIZED  TD  ANALYZE  ABILITY  ESTIMATES  FROM  SHORT  AND  LONG 
TESTS.  IN  THIS  CASE,  ITEM  PARAMETERS  FOR  2G  ITEMS  (FROM  THE 
OVERALL  **TOTAL“  TEST  LENGTH  ANALYSIS)  WERE  USED,  AND  ABILITY 
ESTIMATES  WERE  RECOMPUTED  FOR  THE  SHORT  TEST  UNDER  THE  ONE  AND 
THREE-PARAMETER  MODEL  ASSUMPTIONS.  THE  RESULTING  PARAMETER 
ESTIMATES  WERE  ANALYZED,  AS  ABOVE,  WITH  THE  ADAPT  STATISTICAL 
SYSTEM. 

FOR  EACH  LOGIST  COMPUTER  ESTIMATION  GOST  WAS  TALLIED. 

THE  TWO  MODELS  ARE  EXPLORED  IN  TERMS  OF  THEIR  COMPUTER  COSTS. 
COSTS  ARE  PRESENTED  FOR  EACH  TEST  AND  FOR  VARIOUS  ITEM  AND 
PERSON  SAMPLE  SIZES. 

BECAUSE  A  NUMBER  OF  THE  RESULTS  OF  THIS  5TUOY  CONFLICTED 
WITH  THE  PREDICTIONS  OEREVEO  FROM  THE  THEORY  JF  LATENT  TRAITS, 
ADDITIONAL  ANALYSES  WERE  MADE  TO  CHECK  THE  RESULTS.  FOUR 
ADDITIONAL  LDGIST  ESTIMATIONS  WERE  EXECUTED  ON  ThE  SCHOLASTIC 
APTITUDE  VERBAL  SUBTEST.  IN  EACH  CASE  A  TWENTY  (20)  ITEM 
SUBSET  OF  DATA  WAS  USED.  ONE  SUBSET  WAS  DESIGNED  SUCH  THAT 
THE  ITEM  DISCRIMINATION  PARAMETERS  WERE  EQUAL  (  A  .03  RANGE 
AROUND  THE  MEAN  POI NT -BISER I AL ) .  A  SECONO  SU3TEST  WAS 
DESIGNED  SO  THAT  THERE  RESULTEO  'JNECJAu  ITEM  DISCRIMINATIONS 
(OUTSIDE  OF  A  .1  RANGE  ABOUT  THE  MEAN  POINT-  3ISERIAL). 
ANALYSES  WERE  THEN  PERFORMED  ON  THESE  DATA  TO  COMPARE  THE  ONE 
AND  THREE-PARAMETER  MODELS. 


RESULTS 


FIT  OF  THE  ONE  ANO  THREE  -  PA RAME T E R  LOGISTIC  MOGELS 


FOR  EACH  OF  THE  FIVE  DATA  SETS,  THE  EXPECTEO  RAH  SCORE 
DISTRIBUTION  FIT  THE  OBSERVED  RAH  SCORE  DISTRIBUTION  BETTER 
FDR  THE  ONE-PARAMETER  MODEL  THAN  FOR  THE  THREE -PA R A ME TE R 
MODEL.  CHI  SQUARE  STATISTICS,  AVERAGED  ACROSS  FIVE  TESTS,  ARE 
PRESENTED  IN  TABLE  2.  CHI  SQUARE  STATISTICS  FOR  EACH 
INDIVIDAUL  TEST  ARE  PRESENTED  IN  TABLE  3.  THE  CHI  SQUARE 
STATISTICS  FOR  SMALLER  SAMPLE  SIZES  ARE  -ESS  IN  MAGNITUDE,  AS 
ONE  WOULD  EXPECT,  ALTHOUGH  THERE  HERE  SOME  CONFLICTING  RESULTS 
IN  THE  OATA.  FOR  THE  ONE -P  ARAME  TER  MGQEl.  ,  THE  SHORT  TESTS 
YIELDEO  8ETTER  FITS.  THE  OPPOSITE  RESULT  HOLDS  FOR  THE 
THREE-PARAMETER  MOOEl.  THE  DIFFERENCE  IN  MAGNITUDES  FOR  THE 
CHI  SQUARES  IN  TABLE  3  MIGHT  BE  ATTRIBUTED  TO  THE  WAY  IN  WHICH 
THE  SCORES  HERE  GROUPED,  ESPECIALLY  FOR  THE  LONG  TEST  WHICH 
CONTAINED  A  VARYING  TOTAL  NUMBER  OF  ITEMS.  SCORES  WERE 
USUALLY  GROUPED  INTO  SIX  CATEGORIES,  BUT  IN  SOME  INSTANCES  THE 
LOWEST  RAH  SCORE  GROUP  HAC  FREQUENCIES  TOO  LOW  FOR  COMPUTING 
THE  CHI  SQUARE  STATISTIC.  IN  THIS  CASE,  THE  LOWEST  TWO  SCORE 
GROUPS  WERE  COMBI NEO.  ON  20-ITEM  TESTS,  THE  FIRST  CATEGORY 
INCLUDED  SCORES  1  THROUGH  4,  WHEREAS  ALL  OTHER  CATEGORIES 
CONTAINED  3  SCORES.  ON  LONGER  TESTS,  FIVE  OR  MORE  RAW  SCORES 
COMPOSED  EACH  GROUPING,  WITH  THE  EXCEPTION  OF  THE  LOWEST  AND 
HIGHEST  SCOPE  GROUPS.  THESE  CONTAINED  FROM  SIX  TO  TWELVE  RAW 
SCORES.  ON  ANY  GIVEN  TEST  THE  GROUPINGS  WERE  CONSTANT. 

- INSERT  TABLES  2  AND  3  AROUNO  HERE - 


THE  VERY  HIGH  CHI  SQUARE  STATISTICS  CAN  ALMOST  ALWAYS  BE 
ATTRIBUTED  TO  LACK  OF  FIT  IN  THE  _OWEST  SCORE  GROUPING.  THIS 
EFFECT  WAS  ESPECIALLY  NOTICEABLE  FOR  THE  THREE -P  ARAME  TE  P.  MODEL 
OATA.  EVEN  WITH  THIS  SCORE  CATEGORY  OMITTED,  BETTER  FIT  WAS 
FOUND  FOR  THE  ONE-PARAMETER  MOOEL.  AN  EXCEPTION  TO  THIS  TREND 
WAS  FOUND  FOR  THE  SCIENCE  SUBTEST  OF  HE  STAN-ORO  ACHIEVEMENT 
TEST.  HERE,  THE  FIT  TO  BOTH  MODELS  WAS  EQUAL.  IT  SHOULD  ALSO 
BE  NDTEO  THAT  THE  CRITERION  FOR  FIT  IN  THIS  STUDY,  THE  RAW 


SCORE ,  IS  A  SUFFICIENT  STATISTIC  FOR  THE  RASCH  MODEL,  OUT  NOT 
FOR  THE  THREE-PARAMETER  100 EL.  THE  RESULTS  NEED  TO  BE 
CONSIDERED  IN  VIEW  OF  THIS  FACT. 

ITEM  OISCRIMINATI ON  ,  GUESSING,  AND  UNID 1 1ENSI ONALl TY 

IT  IS  IMPOSSIBLE  TO  OBTAIN  A  RAW  SCORE  OF  ZERO  WITH  THE 
THREE-  PARAMETER  MODEL  I~  ANY  GUESSING  OCCURS.  ALTHOUGH 
L0SI3T  WAS  FAIRLY  ACCURATE  IN  ESTIMATING  GUESSING  FOR  ITEMS 
FALLING  AT  THE  EXTREMES  (NO  GUESSING  OR  MUCH  GUESSING), 
GENERALLY  THE  GUESSING  PARAMETERS  wERE  UNES.TIM ABLE .  THE 
ESTIMATION  PROCEDURE  SETS  THE  GUESSING  PARAMETER  TO  THE 
QUANTITY  (1/"NCH“-. 05)  AT  THE  OUTSET  OF  ESTIMATION,  WHERE  NCM 
IS  THE  NUMBER  OF  MULTIPLE  CHOICE  AL TE RN AT  I V ES .  IF  ESTIMATION 
OF  OTHER  PARAMETERS  IS  STABLE,  GUESSING  IS  ALLOWED  TO  VARY. 
THIS  WAS  NOT  USUALLY  THE  CASE  FOR  THIS  DATA.  THE  FOLLOWING 
ARE  APPROXIMATE  LOWER  BOUNOS  FOR  EXPECTED  RAW  SCORES  UNDER 
THE  THREE-PARAMETER  MODEL  FOR  EACH  OF  THE  FIVE  TESTS) 

SCHOLASTIC  APTITUDE  VERBAL  =  12.75 
CALIFORNIA  MATH  COMPREHENSION  =7.2 
CALIFORNIA  VOCABULARY  =  6.0 
STANFORD  VOCABULARY  =  10.0 
STANFORD  SCIENCE  =  12.0 

(THESE  LOWER  BOUNDS  ARE  COMPUTED  JSING  THE  NUMBER  OF  ITEMS 
AND  NUMBER  OF  CHOICES).  ALTHOUGH  SOME  OF  THE  POOR  FIT  FOR  THE 
THREE-PARAMET ER  MODEL  CAN  BE  ATTRIBUTED  TO  THE  LOWEST  SCORE 
GRO U3 ,  THE  RESULTS  WERE  STILL  RATHER  SURPRISING.  TWO 
P0SSI8LE  EXPLANATIONS  EXIST.  ONE  POSTULATE  IS  THAT  THE  DATA 
CHOSEN  FOR  STUOY  ARE  ALL  ONE-P A RA ME T E R  DATA  .  A  SECOND 
EXPLANATION  IS  THAT  THERE  MAY  BE  SOME  DIFFICULTY  IN  ESTIMATING 
PARAMETERS  FOR  THE  THREE - P ARAME T E R  MODEL  BECAUSE  OF  THE 
ADDITIONAL  NUMBER  OF  UNKNOWN  QUANTITIES  THAT  NEED  TO  BE 
ESTIMATED.  THE  RESULTS  ARE  MOST  LIKELY  A  COMBINATION  OF  THESE 


TWO  explanations 


BOTH  GUESSING  AND  ITEM  DISCRIMINATION  WERE  FURTHER 
INVESTIGATED  TO  OETERMINE  WHETHER  THEY  HAD  BEEN  PROPERLY 
ESTIMATED.  TABLE  4  PRESENTS  SOME  RESULTS  CONCERNING  THE 
GUESSING  PARAMETER.  THE  EXTENT  OF  GUESSING  ON  EACH  TEST  WAS 
DETERMINED  BY  CALCULATING  CLASSICAL  ITEM  DIFFICULTIES  FOR  THE 
25  ‘/.  HOST  DIFFICULT  ITEMS  FOR  THE  LOWEST  DECILE  OF  EXAMINEES 
BASED  ON  THE  SAMPLE  (RAW  SCORE  CRITERION).  ON  THIS  CRITERION, 
EACH  TEST  WAS  RATED  FOR  THE  PERCENT  OF  GUESSING  BEHAVIOR 
DISPLAYED  ON  HARO  ITEMS  BY  LOW  ABILITY  EXAMINEES.  LATENT 
TRAIT  GUESSING  ESTIMATES  WERE  COMPARED  TO  THESE  VALUES.  THE 
LAST  COLUMN  OF  TABLE  5  INDICATES  HOW  OFTEN  LATENT  TRAIT  AND 
CLASSICAL  PARAMETERS  WERE  IN  CONCORDANCE,  WHICH  WAS  DEFINED 
AS  THE  NUMBER  OF  TIMES  THAT  HIGH  LATENT  TRAIT  GUESSING 
ESTIMATES  MATCHED  HIGH  GUESSING  ESTIMATES  USING  CLASSICAL  TEST 
THEORY  INDICATORS.  WITH  THE  EXCEPTION  OF  THE  CALIFORNIA 
VOCABULARY  SUBTEST  (  WHICH  WAS  THE  SHORTEST  AND  MOST  DIFFICULT 
TEST),  LOGIST  WAS  QUITE  ACCURATE  IN  PINPOINTING  ITEMS  AT 
EITHER  EXTREME  (MINIMAL  OR  MAXIMUM  GUESSING).  GENERALLY 
THOUGH,  THE  GUESSING  PARAMETER  WAS  OVERESTIMATED.  ALTHOUGH 
THIS  OVERESTIMATI ON  CLEARLY  EFFECTED  THE  LOWEST  SCORE  GROUP, 

IN  GENERAL,  THE  EFFECTS  OF  THIS  OVERESTIMATION  WERE  NOT  FOUND 
ACROSS  THE  ABILITY  DISTRIBUTION.  THUS,  THE  LESS  ADEQUATE  FIT 
OF  THE  DATA  TO  THE  THRE E- P ARAMET ER  MODEL  CAN  NOT  BE  ATTRIBUTED 
SDLELY  TO  OVERESTIMATION  OF  THE  GUESSING  PARAMETER. 

- - - INSERT  TABLE  4  AROUND 

HERE - 


TA3LES  5  AND  6  PRESENT  RESULTS  CONCERNING  THE  ITEM 
DISCRIMINATION  PARAMETER.  THESE  RESULTS  ARE  3ASE0  ON  20-ITEM 
TESTS  CONSTRUCTED  TO  HAVE  VERY  DIFFERENT  OR  VERY  SIMILAR  ITEM 
DISCRIMINATIONS  (BY  CLASSICAL  ITEM  INDICATORS).  IN  TABLE  5 
CHI  SQUARE  STATISTICS  ARE  COMPUTED  FOR  SIX  SCORE  GROUPS.  IN 
THIS  TABLE  WE  FIND  THAT  WHEN  THE  ITEM  D I  SCRIM INAT I  ON 


PARAMETERS  ARE  VERY  DIFFERENT,  THE  THRE E -PARA  ME 7E R  MODEL  PITS 
THE  DATA  BETTER  THAN  THF  ONE-PARAMETER  MODEL.  FOR  THE  CASE  DF 
EQ J A^  ITEM  DISCRIMINATION,  THE  0 NE - PA RA ME T E R  MODEL  SHOWS 
BETTER  FIT.  REGARDLESS  OF  THE  WAY  IN  WHICH  SCORES  WERE 
GROUPED,  THE  SAME  CHI  SQUARE  TRENO  WAS  FOUND.  FROM  THESE 
RESULTS,  IT  SEEMS  PLAUSIBLE  TO  CONCLUOE  THAT  ALL  OF  THE  DATA 
SETS  USED  IN  THIS  STUDY  HAVE  EQUAL  ITEM  D I SCRI M I N A T 10 NS .  THE 
AVERAGE  CLASSICAL  ITEM-TOTAL  CORRELATION  ( POINT-BISERIAL )  IS 
GIVEN  FOR  EACH  DATA  SET  IN  TABLE  5.  THE  SECOND  COLUMN  OF  THE 
TABLE  SHOWS  THE  PERCENT  OF  ITEMS  THAT  FA_L  WITHIN  THE 
CONFIDENCE  BAND  OF  THE  MEAN  POINT-8 J.SERAIL  PLUS  OR  MINUS  .1. 
GENERALLY,  THE  MAJORITY  OF  CLASSICAL  POINT- BISERAILS  ARE  QUITE 
CLOSE  IN  MAGNITUDE.  IT  IS  SUGGESTED  THAT  WHEN  THE  ITEM 
DISCRIMINATIONS  ARE  TRULY  EQUIVALENT,  THE  THREE-PARAMETER 
ESTIMATION  PROCEDURE  MAY  PRODUCE  INCONSISTENT  ESTIMATES  FOR 
ITEM  DISCRIMINATION.  RESEARCH  CONCURRENT  WITH  THIS 
(SWAMINATHAN  AND  GIFFORO,  1979)  HAS  INDICATED  THAT  ITEM 
DISCRIMINATION  TENDS  TO  BE  OVERESTIMATED  BY  THE  MAXIMUM 
LIKELIHOOD  PROCEDURE. 

- INSERT  TABLES  5  AND  6  AROUND  HERE - 


IT  WAS  IMPOSSIBLE  TO  DETERMINE  THE  I  NT ERREL A T I CNSH IP 
BETWEEN  GUESSING,  ITEM  Cl SCRIMINA TI ON ,  AND  MODEL  FIT  ^GR 
SPECIFIC  OATA  SETS  IN  THIS  STUDY.  THE  TWO  SUBTESTS  ON  WHICH 
EXAMINEES  SHOWED  THE  MOST  GUESSING,  ALSO  HAD  THE  NARROWEST 
RANGE  OF  ITEM  DISCRIMINATIONS.  ONE  OF  THESE,  THE  STANFORD 
VOC A3ULARY ,  SHOWED  CLOSE  FIT  TO  THE  RASCH  MOOEL,  AND  GOOD  FIT 
TO  THE  THREE-PARAMETER  MOGEL  AS  WELL.  HE  OTHER,  STANFORD 
SCIENCE,  WAS  THE  SINGLE  TEST  THAT  FIT  THE  T HREE-P ARAMET ER 
MODEw  AS  WELL  AS  THE  RASCH  MOOEL. 

AN  EXPLANATION  OF  HDCEl  FIT  IN  TERMS  OF  UNIOI MEN5I ONALIT Y 
IN  THIS  STUQY  IS  CONFOUNOEG  8Y  THE  FACT  THAT  TESTS  DIFFERED  IN 
BOTH  LENGTH  ANO  DIFFICULTY.  IT  CAN  BE  SAID,  HOWEVER,  THAT  THE 


STANFORO  VOCABULARY  SUBTEST  FIT  3D  T  H  MODELS  BETTER  THAN  THE 
OTHER  TESTS*  ALTHOUGH  THIS  TEST  HAS  NOT  THE  MOST 
UNIDIMENSIONAL .  TABLE  6  CHARACTERIZES  DIMENSIONALITY  OF  TESTS 
IN  TERMS  OF  THE  FIRST  LATENT  ROOT  FROM  THE  PRINCIPAL  COMPONENT 
ANALYSIS.  AND  SHOWS  THE  VARIANCE  ACCOUNTED  FOR  BY  THE  FIRST 
FACTOR.  BY  THESE  CRITERIA,  THE  TEST  WHICH  BEST  MEETS  THE 
ASSUMPTION  O*  UNIDIMENSIONALITY  IS  THE  CALIFORNIA  MATH  TEST. 
THIS  TEST  IS  ALSO  THE  EASIEST  TEST  IN  TERMS  OF  AVERAGE 
CLASSICAL  ITEM  DIFFICULTIES.  THE  RESULTS  SHOW  THAT  THIS  TEST 
FIT  BOTH  MODELS  QUITE  WELL.  THE  CHI  SQAURE  STATISTIC  FOR 
RASCH  MODEL  FIT  WAS  1.02,  THE  SEC3NC  3EST  F.IT  FOUNO  IN  THE 
STUDY. 


sample  SIZE 

TABLE  7  PROVIDES  DATA  ON  THE  ACCURACY  QF  PARAMETER 
ESTIMATION  FOR  SMALL  SAMPLES  <N=250>.  THE  RESULTS  ARE 
AVERAGED  ACROSS  THE  FIVE  TESTS.  PEARSON  PRODUCT  MOMENT 
CORREL ATIONS,  SPEARMAN  RANK  OROER  CORRELATIONS,  AND  AVERAGE 
ABSOLUTE  DIFFERENCES  BETWEEN  PARAMETERS  ESTIMATED  WITH  THE 
1000  PERSON  AND  250  PERSON  SAMPLES  ARE  GIVEN.  ALL  ESTIMATES 
WERE  FIRST  STANDARDIZED  TO  MEAN  ZERO  TO  OBTAIN  THESE  RESULTS. 
ESTIMATES  FOR  DIFFICULTY  ARE  QUITE  ACCURATE  IN  THE  SMALL 
SAMPlE  FOR  BOTH  MODELS.  THE  SMALL  SAMPLE  ESTIMATE  FOR 
GUESSING,  ALTHOUGH  CLOSE  IN  MAGNITUDE  TO  THE  LARGE  SAMPLE 
ESTIMATE,  HAD  A  LOW  CORRELATION  WITH  THE  LARGER  SAMPLE 
ESTIMATE.  IT  IS  APPARENT  FROM  THIS  DATA  THAT  250  PERSONS  MAY 
NOT  BE  A  SUFFICIENT  SAMPLE  SIZE  UPON  WHICH  TO  ESTIMATE 
GUESSING.  IN  FACT,  EVEN  IN  THE  130Q-PER5ON  SAMPLE,  THE 
MAJORITY  OP  GUESSING  PARAMETERS  FOR  THIS  DATA  REMAINED 
UNESTIMATED  BY  THE  MAXIMUM  LIKELIHOOD  METHOD.  ESTIMATION  OF 
ITEM  DISCRIMINATION  IN  THE  250  PERSON  SAMPLE  IS  RELATIVELY 
CONSISTENT  WITH  1000  PERSON  ESTIMATE.  BUT,  BY  THE  AVERAGE 
ABSOLUTE  DEVIATION  CRITERION,  THIS  SMALL  SAMPLE  ESTIMATE 


PAIRED  LESS  WELL  THAN  EITHER  GUESSING  OR  DIFFICULTY.  IT 
AP°EARS  THAT  WHEN  DISCRIMINATION  IS  POORLY  ESTIMATED,  ALL 
OTHER  ESTIMATES  ARE  EFFECTED.  THEREFORE,  THE  DIFFICULTY 
PARAMETERS  IN  THE  T HREE -P AR AME T E R  CASE  DO  NOT  APPEAR  TO  3E 
ESTIMATED  AS  EFFECTIVELY  WITH  SMALL  SAMPLES  AS  IN  THE 
ONE-PARAMETER  CASE. 


- INSERT  TABLE  7  AROUND  HERE - 

TEST  LENGTH 

TEST  LENGTH  WAS  EXAMINED  TO  DETERMINE  WHETHER  LATENT 
TRAIT  THEORY  CAN  BE  APPLIED  TO  SHORT  TESTS  (23  ITEMS).  TABLE 
8  PRESENTS  THE  RESULTS  O"  THIS  ANALYSIS  IN  TERMS  OF  PEARSON 
AND  SPEARMAN  CORRELATIONS,  AND  AVERAGE  A3S0LUTE  DIFFERENCES 
BETWEEN  SHORT  AND  LONG  TESTS,  AVERAGED  ACROSS  FIVE  DATA  SETS. 
FOR  30TH  MODELS,  ESTIMATES  OF  ABILITY  FROM  THE  SHORT  TEST  WERE 
REASONABLY  CONSISTENT  WITH  ESTIMATES  DERIVED  FROM  THE  LONGER 
TESTS.  HERE,  AS  BEFORE,  MORE  CONSISTENCY  WAS  FOUND  FOR  THE 
ONE-PARAMETER  model. 


- INSERT  TA8LE  3  AROUND  HERE - 

COSTS 

IN  ADDITION  TO  FINDING  IMPROVEMENT  IN  FIT  FOR  THE 
ONE-PARAMETER  MODEL  BY  STATISTICAL  CRITERIA,  THE  DATA  IN  T  A  E 
9  DEMONSTRATE  THAT  THE  COSTS  OF  ESTIMATING  RASCH  PARAMETER 
VALUES  ARE  CONSIDERABLY  LESS  THAN  THOSE  FOR  HE 
THREE-PARAMETER  MOOEL.  THE  COSTS  SHOWN  IN  TABLE  9  ARE 
AVERAGED  ACROSS  FIVE  TESTS.  THIS  TABLE  ALSO  SHOWS  THE 
RELATIONSHIP  BETWEEN  COMPUTER  COSTS  FOR  LATENT  TRAIT  ESTIMATES 
AND  THE  NUM8ER  OF  PERSONS  AND  ITEMS  ESTIMATED.  THESE  COSTS 
ARE  3ASED  ON  A  CHARGE  OF  £  403  PER  HOUR.  THEY  DO  NOT  REFLECT 


AUXILIARY  COSTS  (DISC  STORAGE,  MAGNETIC  TAPES,  DATA 
PREPARATION,  ETC.).  ALL  OF  THE  FIGURES  IN  TABLE  9  ARE  BASED 
ON  EXECUTIONS  OF  LOGIST  IN  WHICH  PEPSON  AND  ITEMS  ARE 
ESTIMATED  SIMULTANEOUSLY.  TA3LES  10  AND  11  SHED  DIFFERENT 
LIGHT  ON  THE  COSTS  OF  THE  ONE  AND  T HREE - ? AR AME T E R  MODELS. 

TABLE  10  INDICATES  COMPUTER  COSTS  AVERAGED  OVER  FIVE  20-ITEM 
TESTS  WHEN  ITEM  PARAMETERS  ARE  KNOWN.  THERE  IS  ESSENTIALLY  NO 
DIFFERENCE  8ETWEEN  THE  COSTS  OF  ESTIMATING  ABILITY  FOR  THE  ONE 

and  three-parameter  mooels.  since  this  is  the  usual  manner  in 

WHICH  LATENT  TRAIT  THEORY  IS  APPLIED,  THIS  EQUIVALENCE  OF 
COSTS  SHOULD  BE  NOTED  BY  PRACTITIONERS  PLANNING  TO  USE  THESE 
MODELS.  TABLE  10  GIVES  COHPUTER  COSTS  FOR  LOGIST  RUNS 
AVERAGED  ACROSS  FIVE  TESTS  FOR  ESTIMATING  ITEM  PARAMETERS  ON 
SAMPLES  OF  250  PERSONS  WHEN  ABILITY  IS  KNOWN  .  THE  COSTS 
GIVEN  FOR  THIS  STUDY  CAN  ONLY  BE  GENERALIZED  TO  THE  LOGIST 
COMPUTER  PROGRAM  ANG  DO  NOT  APPLY  TO  COMPARISONS  WITH  OTHER 
ESTIMATION  ROUTINES.  IF  THE  ONE-3ARAME  TER  ESTIMATION  HAD 
BEEN  EXECUTED  ON  THE  BlCAL  COMPUTER  PROGRAM  (WRIGHT  AND  MEAD, 
1976  )  ,  THE  COMPUTER  COSTS  FOR  THE  ONE-PAF.AMFTER  HuCEL  WOULD 
HAVE  BEEN  CONSIDERABLY  LESS.  In  THE  BICAl  PROCEDURE  ONE 
EQUATION  IS  NEEDED  FOR  EACH  RAW  SCORE  CATEGGRY,  WHEREAS  IN  THE 
MAXIMUM  LIKELIHOOD  METHOO,  SEPARATE  EQUATIONS  ARE  NEEOEO  FOR 
EACH  EXAMINEE. 

TABLE  1 4  HIGHLIGHTS  COSTS  FOR  EACH  SUBTEST  .  THERE  IS  A 
RELATIONSHIP  BETWEEN  THE  NUMBER  OF  ITEMS  IN  A  TEST  AND  ITS 
COST,  BUT  THE  HIGHER  COSTS  FOR  SOME  SUBTESTS  CAN  ALSO  BE 
ATTRIBUTED  TO  A  LOWER  DEGREE  OF  UNI DI ME  NS IONAL I T Y . 

- INSERT  TABLES  9,10,11  AND  12  A  ROUND  HERE. - 


SUMMARY  AND  CONCLUSIONS 


THE  RESULTS  OF  THIS  STUOY  INDICATE  THAT  FOR  DATA  HAVING 


ITEMS  EQUAL  IN  DISCRIMINATION,  THE  RASCH  MODEL  PRjVIOES  SETTER 
FIT  TO  EMPIRICAL  DATA  THAN  THE  THREE-PAR A  ME  TER  LOGISTIC  MODEL 
.  A  PRACTICAL  METHCO  FOR  DETERMINING  EGJALITT  OF  ITEM 
DISCRIMINATION,  USING  CLASSICAL  POlNT-BISERIALS,  HAS 
SUGGESTED.  IT  WAS  ALSO  NOTED  THAT  THE  MAXIMUM  LIKELIHOOD 
ESTIMATE  OP  THE  DISCRIMINAT ION  PARAMETER  MAY  3E  INADEQUATE  AT 
THIS  TIME.  AS  IMPROVEMENTS  ARE  MADE  IN  THE  THREE-PARAMETER 
ESTIMATION  METHODS,  A  MORE  SENSITIVE  ESTIMATE  CF  THIS 
PARAMETER  MAY  8E  FOUND. 

ALTHOJGH  THE  DATA  USED  IN  THIS  STUDY  WERE  MULTIPLE  CHOICE 
IN  NATURE,  VIOLATION  OF  THE  "NO  SJESSING"  ASSUMPTION  OF  THE 
RASCH  MODEL  DID  NOT  APPEAR  TO  EFFECT  FIT  OF  THE  ONE -PARAME TER 
MODE»  TO  OATA.  THE  MAXIMUM  LIKELIHOOD  PROCEDURE  TENDED  TO 
OVERESTIMATE  GUESSING  FOR  THIS  OATA.  THIS  CAUSED  REOUCEO 
MODEL-OATA  FIT  OF  THE  T HR  EE -P ARA ME T ER  MODEL  ESPECIALLY  IN  THE 
LOWER  AGILITY  RANGE.  GENERALLY,  GUESSING  WAS  UNESTIMABLE  FOR 
THIS  DATA.  UNFORTUUNATELY,  NO  ALTERNATIVE  CRITERIA  COULD  BE 
FOUND  FOR  ESTIMATING  THE  TRUE  AM 0 J NT  OF  GUESSING.  BECAUSE 
GUESSING  AND  DISCRIMINATION  WERE  CONFOUNDED  IN  HE  DATA,  IT 
WAS  IMPOSSIBLE  TO  DETERMINE  WHETHER  THE  GUESSING  PARAMETER 
MIGHT  HAVE  IMPROVED  FIT  IN  THE  THREE- PARAMETER  CASE. 

EMPIRICAL  DATA,  SUCH  AS  OPEN-ENDED  TEST  QUESTIONS,  IN  WHICH 
GUESSING  IS  IMPROBABLE,  IS  NEEDED  TC  CO  M3 ARE  FIT  OF  THE  ONE 
And  THREE-PARAMETER  MODELS.  RESEARCH  INTO  THIS  AREA  MIGHT 
BEST  BE  CONCUCTED  THROUGH  STUDIES  USING  SIMULATED  CATA.  WITH 
ARTIFICIAL  DATA,  FACTORS,  SUCH  AS  THOSE  CONFOUNOING  THE 
CURRENT  RESEARCH,  COULD  BE  CONTROLLED.  BETTER  ESTIMATES  ARE 
NEEDED  FOR  BOTH  ITEM  DISCRIMINATION  AND  GUESSING  IF  THE 
THREE-PARAMETER  MOOEL  IS  TO  BE  USED  EFFECTIVELY. 

USING  A  FACTOR  ANALYTIC  CRITERION,  THE  DATA  USED  IN  THIS 
STUDY  HERE  ALL  FOUND  TO  HAVE  ONE  GENERAL  FACTOR  WHICH,  IN  AL. 
CASES,  ACCOUNTED  FOR  MORE  THAN  20  PERCENT  OF  THE  TEST 
VARIANCE.  THE  OATA  INDICATE  THAT  THE  MORE  A  DATA  SET  MEETS 
THIS  ASSUMPTION,  THE  LESS  TIME  IT  TAKES  TO  CONVERGE  TO  A 


SOLUTION  BY  the  logist  program,  there  also  appeared  to  be 
some  improvement  of  fit  to  both  models  for  oata  that  showed 

EXTREMELY  STRONG  FIf.ST  FACTOR  VARIANCE.  MORE  RESEARCH  IN  THIS 
AREA  IS  NEEDED  WITH  OATA  SETS  THAT  CLEARLY  VIOLATE  THE 
ASSUMPTION  C~  UNIOIMENSIONALITY.  IN  ADDITION,  CRITERIA, 

OTHER  THAN  FACTOR  ANALYSIS,  ARE  NEEDED  FOR  DETERMINING  THE 
EXTENT  OF  DIMENSIONALITY  IN  DATA. 

ALTHOUGH  THE  A  31  LI T Y  ESTIMATES  FROM  SHORT  TESTS  WERE 
REASONABLY  GOOD,  ITEM  ESTIMATES  FROM  SMALL  SAMPLES  OF  PERSONS 
TENDED  NOT  TO  BE  SO  GOOD.  THIS  RESULT  WAS  ESPECIALLY  APPARENT 
IN  ESTIMATING  ITEM  DISCRIMINATION  FROM  SMALL  SAMPLES. 

WHEN  THE  LOGIST  PROGRAM  IS  USED  WITH  KNOWN  ITEM 
PARAMETERS,  THE  COST  OF  ESTIMATION  IN  THE  ONE  AND 
THREE-PARAMETER  CASES  IS  EQUIVALENT.  IN  ESTIMATING  ITEM 
PARAMETERS  SIMULTANEOUSLY  WITH  ABILITY,  THE  SAVINGS  FOUND  BY 
USING  THE  ONE-PARAMETER  MOOEL  ARE  CONSIDERABLE.  IT  IS 
DIFFICULT  TO  COMMENT  ON  THIS  COST  DIFFERENTIAL  UNTIL  IT  IS 
DETERMINED  WHETHER  THERE  ARE  OTHER  SUBSTANTIAL  GAINS  TO  BE 
FOUND  WITH  THE  THREE-PARAMETER  MODEL. 

IN  SUMMARY,  USING  COSTS  AND  "IT  TO  TEST  SCORE 
DISTRIBUTIONS  AS  CRITERIA,  THE  RASCH  MODEL  WAS  CLEARLY 
SUPERIOR  IN  FIT  TO  EMPIRICAL  OATA  THAN  THE  THREE -P ARA ME T E R 
LOGISTIC  MODEL.  IT  IS  IMPORTANT  TO  POINT  OUT  THAT  OTHER 
CRITERIA  FOR  FIT  MIGHT  HAVE  BEEN  SELECTED  WHICH  WOULD  HAVE 
SHOWN  BETTER  FIT  FOR  THE  THREE-PARAMETER  MODEL.  FOR  EXAMPLE, 
IF  A  WEIGHTED  RAW  SCORE  HAD  BEEN  UTILIZED,  RATHER  THAN  THE 
SIMPLE  RAW  SCORE,  IMPROVEMENT  OF  FIT  FOR  THE  T  HRE  E -PARA  ME  T  E  P, 
MODEL  MIGHT  HAVE  BEEN  SEEN.  THE  RESULTS  ALSO  SHOW  THAT  IN  THE 
CASE  WHEN  ITEM  DISCRIMINATIONS  ARE  OUITE  DISSIMILAR,  THE 
THREE-PARAMETER  MOOEL  DEMONSTRATED  SUPERIOR  FIT  TO  THE  RASCH 
MODEL.  RESEARCH  IS  NEEOEu  TO  DETERMINE  HOW  UNEQUAL  ITEM 
DISCRIMINATION  NEEO  TO  BE  FOR  THE  T HREE -p AR AME TE R  MOOEL  TO 
BECOME  MORE  EFFECTIVE.  HERE  AGAIN  A  SI M JL A T E D-OA T A  STUDY  , 


SIM I_ AR  TO  THE  ONE  PROJECTED  A00VE  FOR  SUESSINO,  IS  NEEDED  IN 
CON JJ NOTION  WITH  REFININS  THE  ESTIMATION  PROCEDURES. 

FINALLY,  IT  IS  IMPORTANT  TO  POINT  OJT  THAT  THE 
CONCLUSIONS  DRAWN  IN  THIS  PAPER  ARE  TENTATIVE.  THE  PROJECT  IS 
IN  MIDSTREAM:  ONLY  HALF  DF  THE  PROJECTED  DATA  SETS  HAVE  BEEN 
ANALYZED  TO  DATE. 
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